This is the chronological compilation of the bits and pieces from 02-, 05-, and 09-.
NOTE: there are four extra samples in this analysis compared to the 09-downsampled version. Looks like it happens in the juvenile samples part, but needs to be fleshed out more.
Hopefully it will work… read in data and libraries
library(tidyverse)
library(CKMRsim) # add this for the index markers function
# genotypes
genos <- readRDS("../new_baseline_data/processed/called_genos_na_explicit.rds") %>%
filter(!str_detect(NMFS_DNA_ID, "N")) # remove Lorne's ambiguous yelloweye samples
labels <- readRDS("../new_baseline_data/processed/label-tibble.rds")
samples <- readRDS("../new_baseline_data/processed/sample-sheet-tibble.rds")
# meta data
meta <- readRDS("../new_baseline_data/processed/meta-data-tibble.rds") %>%
select(1,8,13,22) %>% # just the relevant columns for now
mutate(REPORTED_LIFE_STAGE = ifelse(REPORTED_LIFE_STAGE == "Adult", "ADULT", REPORTED_LIFE_STAGE)) %>% # make the syntax consistent
mutate(REPORTED_LIFE_STAGE = ifelse(REPORTED_LIFE_STAGE == "Juvenile", "JUVENILE", REPORTED_LIFE_STAGE)) %>%
mutate(REPORTED_LIFE_STAGE = ifelse(is.na(REPORTED_LIFE_STAGE), "UNKNOWN", REPORTED_LIFE_STAGE)) # be explict about NAs
# there are thousands of nmfs ids with two metadata entries
# remove duplicates
meta <- meta %>%
distinct()
# make the gtseq run a consistent data type
labels$gtseq_run <- as.integer(labels$gtseq_run)
# how many unique samples at this stage?
genos %>%
select(NMFS_DNA_ID) %>%
unique()
Okay, there are some species names that need to be modified because there is the “S” for Sebastes in front of the name:
# make species labels consistent
labels1 <- labels %>%
mutate(species = ifelse(grepl("S", species), (gsub("S", "", species)), species))
labels1 %>%
group_by(species) %>%
tally() %>%
arrange(n) #%>%
#write.table("csv_outputs/sebastes_meta_species_tally.csv")
Excluding marmoratus, that leaves 58 species (which one wasn’t included in future analyses? Probably brevispinis)
Now join the data
1,898 samples (but these are not wholly unique - some are duplicates from different sequencing runs)
With the species information tacked on, I should be able to organize my data table for self-assignment using rubias.
I’m not sure if there are any of these, but best to leave it in here…
Now, here is a harder operation: if an individual is multiply-genotyped, take the genotype with the highest total read depth.
Seven NMFS_DNA_IDs occur twice - so we narrowed those duplicates down for a total of 1,795 actual samples genotyped.
geno_one_each %>%
group_by(NMFS_DNA_ID) %>%
select(species, NMFS_DNA_ID) %>%
unique() %>%
ungroup() %>%
group_by(species) %>%
tally() %>%
arrange(desc(n))
59 species?! Again, still including marmoratus and brevispinis. I could choose to exclude those earlier?
# remove the genotypes for Scorpaenichthys marmoratus
geno_one_each2 <- geno_one_each %>%
filter(species != "marmoratus")
Total number of samples for manuscript:
geno_one_each2 %>%
ungroup() %>%
select(NMFS_DNA_ID) %>%
unique()
1,793 samples.
species affiliations:
spp_tally <- geno_one_each2 %>%
ungroup() %>%
select(NMFS_DNA_ID, species) %>%
unique() %>%
ungroup() %>%
group_by(species) %>%
tally() %>%
arrange(desc(n)) #%>%
#write_csv("csv_outputs/58species_tally_for_manuscript.csv")
58 species with the following number of samples per species
# read in a list of the 6 loci
to_remove <- read_csv("../data/loci_to_remove.csv")
Parsed with column specification:
cols(
locus = col_character()
)
# only keep the loci that are not those 6
keepers <- geno_one_each2 %>%
anti_join(., to_remove, by = "locus")
# that should leave 90 loci
Now, toss out any individual with fewer than 65 non-missing loci
no_hi_missers <- keepers %>%
group_by(NMFS_DNA_ID, gtseq_run) %>%
filter(sum(!is.na(allele)) >= (65*2))
So, we started with 1795 and after filtering out indivs with fewer than 65 genotyped loci, we were left with 1657 individuals. Those are the ones that we will run through rubias to identify to species.
1793 down to 1657 after removing individuals with too much missing data.
rock2 <- no_hi_missers %>%
dplyr::select(NMFS_DNA_ID, species, locus, allele) %>%
mutate(Chrom = "GTseq") %>%
mutate(Pos = as.integer(factor(locus, levels = unique(locus)))) %>%
dplyr::rename(Locus = locus,
Allele = allele) %>%
dplyr::select(NMFS_DNA_ID, Chrom, Locus, Pos, Allele) %>%
ungroup()
Adding missing grouping variables: `gtseq_run`
Adding missing grouping variables: `gtseq_run`
# get the allele freqs
rock_ckmr_markers <- rock2 %>%
filter(!is.na(Allele)) %>% # it is vital to filter out the NAs at this stage
group_by(Chrom, Locus, Pos, Allele) %>%
dplyr::summarise(counts = n()) %>%
group_by(Locus, Pos) %>%
mutate(Freq = counts / sum(counts)) %>%
dplyr::select(-counts) %>%
mutate(AlleIdx = 1,
LocIdx = 1) %>%
reindex_markers()
# add reference column to prepare data for rubias
dataset <- no_hi_missers %>%
mutate(sample_type = "reference") %>%
rename(collection = species) %>%
rename(indiv = NMFS_DNA_ID) %>%
mutate(repunit = collection) %>%
ungroup() %>%
select(sample_type, repunit, collection, indiv, locus, gene_copy, allele) # reorder the columns
We are going to do this by turning alleles into integers and spreading it and then getting it into the right format to run rubias.
# first make integers of the alleles
alle_idxs <- dataset %>%
#dplyr::select(NMFS_DNA_ID, locus, gene_copy, allele) %>%
group_by(locus) %>%
mutate(alleidx = as.integer(factor(allele, levels = unique(allele)))) %>%
ungroup() %>%
arrange(indiv, locus, alleidx) # rubias can handle NA's, so no need to change them to 0's
# select just the columns to retain and spread the alleles
alle_idx2 <- alle_idxs[,-7]
# figure out what to do about the duplicates:
two_col <- alle_idx2 %>%
group_by(indiv, locus) %>%
mutate(gene_copy = 1:2) %>% # this is to correct the errors in gene copy numbers introduced by the duplicate samples
unite(loc, locus, gene_copy, sep = ".") %>%
spread(loc, alleidx)
From this, swap the identity of the maliger (actually caurinus); remove the helvomaculatus, simulator, and brevispinis.
09-downsampled-baseline.RmdHow many of those are juveniles?
juvies <- new2col %>%
left_join(., meta, by = c("indiv" = "NMFS_DNA_ID")) %>%
filter(REPORTED_LIFE_STAGE == "JUVENILE") %>%
select(collection, indiv) %>%
unique()
juv_samples <- juvies %>%
group_by(collection) %>%
tally() %>%
rename(juv_samples = n)
There are 164 juvenile samples
If I remove samples that are juveniles, what are my numbers per species?
new2col %>%
semi_join(., meta, by = c("indiv" = "NMFS_DNA_ID")) %>%
#anti_join(., juvies, by = "indiv") %>%
group_by(collection) %>%
tally() %>%
rename(total_samples = n) %>%
left_join(., juv_samples, by = "collection") %>%
mutate(remaining_samples = total_samples-juv_samples) %>%
arrange(remaining_samples)
We lose reedi and wilsoni when we exclude juveniles, and drop down to just 2 samples of crameri and five of serriceps.
However, generally including juveniles is a bad idea. So here’s what I’ll do: For any species for which we end up with fewer than 5 samples, I will include juveniles up to that number.
# just juvenile samples for these three species
juvs_to_keep <- new2col %>%
ungroup() %>%
filter(collection %in% c("reedi", "wilsoni", "crameri")) %>%
left_join(., meta, by = c("indiv" = "NMFS_DNA_ID")) %>%
group_by(collection) %>%
filter(REPORTED_LIFE_STAGE == "JUVENILE") %>%
select(1:184)
Now that I have those samples selected, I can remove all juvenile samples and put back those from the juvs_to_keep
Keeping just the juveniles from those three species, we have a total of 1,530 samples from 54 species.
What if I take a maximum of 32 samples per species? (use set.seed to get reproducible results!)
# which species have fewer than 32 samples?
sm_grps <- dataset %>%
group_by(collection) %>%
tally() %>%
arrange(n) %>%
filter(n < 33)
# make a 2-col dataframe with just those groups
sm_d2 <- dataset %>%
semi_join(., sm_grps) %>%
ungroup()
Joining, by = "collection"
# which species have more than 32 samples?
lrg_grps <- dataset %>%
group_by(collection) %>%
tally() %>%
arrange(n) %>%
filter(n > 32)
# downsample those groups
set.seed(5)
down_sam <- dataset %>%
semi_join(., lrg_grps) %>%
group_by(collection) %>%
sample_n(., 32, replace = FALSE) %>%
ungroup()
Joining, by = "collection"
# finish the set.seed
set.seed(NULL)
# and add the data from the other groups back into the dataframe
down_data <- sm_d2 %>%
bind_rows(down_sam) #%>%
# group_by(collection) %>%
# tally() %>%
# arrange(n)
Now do the self-assignment:
assign_down <- self_assign(down_data, gen_start_col = 5)
Summary Statistics:
1010 Individuals in Sample
90 Loci: Plate_1_A01_Sat_GW603857_consensus.1, Plate_1_A11_Sat_GE820299_consensus.1, Plate_2_A09_Sat_EW986980_consensus.1, Plate_2_C08_Sat_EW987116_consensus.1, Plate_2_G06_Sat_EW987118_consensus.1, Plate_3_C03_Sat_GE798118_consensus.1, Plate_4_E10_Sat_EW976030_consensus.1, Plate_4_G06_Sat_EW976181_consensus.1, tag_id_1049.1, tag_id_108.1, tag_id_1184.1, tag_id_1229.1, tag_id_1272.1, tag_id_1366.1, tag_id_1428.1, tag_id_143.1, tag_id_1441.1, tag_id_1449.1, tag_id_1471.1, tag_id_1498.1, tag_id_1558.1, tag_id_1576.1, tag_id_1598.1, tag_id_1604.1, tag_id_1613.1, tag_id_162.1, tag_id_1652.1, tag_id_170.1, tag_id_1708.1, tag_id_1748.1, tag_id_1751.1, tag_id_1762.1, tag_id_179.1, tag_id_1804.1, tag_id_1808.1, tag_id_1810.1, tag_id_1836.1, tag_id_1850.1, tag_id_1880.1, tag_id_1889.1, tag_id_1915.1, tag_id_1950.1, tag_id_1961.1, tag_id_1966.1, tag_id_1982.1, tag_id_1994.1, tag_id_1999.1, tag_id_2008.1, tag_id_2009.1, tag_id_2017.1, tag_id_2062.1, tag_id_2082.1, tag_id_2114.1, tag_id_2134.1, tag_id_2155.1, tag_id_2178.1, tag_id_2182.1, tag_id_220.1, tag_id_2203.1, tag_id_221.1, tag_id_2214.1, tag_id_2237.1, tag_id_2247.1, tag_id_2258.1, tag_id_2301.1, tag_id_2319.1, tag_id_2368.1, tag_id_2499.1, tag_id_250.1, tag_id_2607.1, tag_id_2635.1, tag_id_265.1, tag_id_325.1, tag_id_402.1, tag_id_410.1, tag_id_436.1, tag_id_55.1, tag_id_572.1, tag_id_67.1, tag_id_770.1, tag_id_788.1, tag_id_843.1, tag_id_855.1, tag_id_874.1, tag_id_875.1, tag_id_879.1, tag_id_913.1, tag_id_942.1, tag_id_981.1, tag_id_987.1
54 Reporting Units: aleutianus, alutus, auriculatus, aurora, babcocki, borealis, caurinus, chlorostictus, constellatus, crameri, dallii, diaconus, diploproa, elongatus, emphaeus, ensifer, entomelas, flavidus, hopkinsi, jordani, maliger, melanops, melanostictus, melanostomus, miniatus, moseri, mystinus, nebulosus, nigrocinctus, oculatus, paucispinis, pinniger, polyspinis, proriger, rastrelliger, rosaceus, ruberrimus, rubrivinctus, rufinanus, rufus, saxicola, semicinctus, serriceps, umbrosus, zacentrus, reedi, wilsoni, atrovirens, carnatus, chrysomelas, goodei, levis, ovalis, serranoides
54 Collections: aleutianus, alutus, auriculatus, aurora, babcocki, borealis, caurinus, chlorostictus, constellatus, crameri, dallii, diaconus, diploproa, elongatus, emphaeus, ensifer, entomelas, flavidus, hopkinsi, jordani, maliger, melanops, melanostictus, melanostomus, miniatus, moseri, mystinus, nebulosus, nigrocinctus, oculatus, paucispinis, pinniger, polyspinis, proriger, rastrelliger, rosaceus, ruberrimus, rubrivinctus, rufinanus, rufus, saxicola, semicinctus, serriceps, umbrosus, zacentrus, reedi, wilsoni, atrovirens, carnatus, chrysomelas, goodei, levis, ovalis, serranoides
8.52145214521452% of allelic data identified as missing
Assignment accuracy?
assign_down %>%
ungroup() %>%
filter(scaled_likelihood > 0.95) %>%
mutate(accurate = if_else(repunit == inferred_repunit, TRUE, FALSE)) %>%
filter(accurate == TRUE) #%>%
#filter(collection == "carnatus")
993/1010
[1] 0.9831683
z-scores?
There are 10 samples with z-scores < -3, including 5 hopkinsi samples.
Let’s remove those.
down_data2 <- down_data %>%
ungroup() %>%
anti_join(., z_outliers)
Joining, by = c("repunit", "collection", "indiv")
Removing those leaves us with 1000 samples.
Just to test it out, I’m going to remove these outliers from the baseline and then perform rubias’s mixture assignment with them.
# reference
down_data2 <- down_data2 %>%
ungroup()
# outliers for mixture assignment
zs <- z_outliers %>%
select(indiv)
just_zs <- down_data %>%
ungroup() %>%
right_join(., zs) %>%
mutate(sample_type = "mixture")
Joining, by = "indiv"
Now that the data are separated and formatted, do the mixture assignment
mix_zs <- infer_mixture(reference = down_data2, mixture = just_zs, gen_start_col = 5)
Collating data; compiling reference allele frequencies, etc. time: 0.66 seconds
Computing reference locus specific means and variances for computing mixture z-scores time: 0.51 seconds
Working on mixture collection: flavidus with 1 individuals
calculating log-likelihoods of the mixture individuals. time: 0.00 seconds
performing 100 burn-in and 2000 more sweeps of method "MCMC" time: 0.29 seconds
tidying output into a tibble. time: 0.05 seconds
Working on mixture collection: hopkinsi with 5 individuals
calculating log-likelihoods of the mixture individuals. time: 0.00 seconds
performing 100 burn-in and 2000 more sweeps of method "MCMC" time: 0.26 seconds
tidying output into a tibble. time: 0.04 seconds
Working on mixture collection: melanops with 1 individuals
calculating log-likelihoods of the mixture individuals. time: 0.00 seconds
performing 100 burn-in and 2000 more sweeps of method "MCMC" time: 0.25 seconds
tidying output into a tibble. time: 0.04 seconds
Working on mixture collection: nigrocinctus with 1 individuals
calculating log-likelihoods of the mixture individuals. time: 0.00 seconds
performing 100 burn-in and 2000 more sweeps of method "MCMC" time: 0.25 seconds
tidying output into a tibble. time: 0.03 seconds
Working on mixture collection: umbrosus with 1 individuals
calculating log-likelihoods of the mixture individuals. time: 0.00 seconds
performing 100 burn-in and 2000 more sweeps of method "MCMC" time: 0.25 seconds
tidying output into a tibble. time: 0.04 seconds
Working on mixture collection: ovalis with 1 individuals
calculating log-likelihoods of the mixture individuals. time: 0.00 seconds
performing 100 burn-in and 2000 more sweeps of method "MCMC" time: 0.23 seconds
tidying output into a tibble. time: 0.04 seconds
mix_zs$indiv_posteriors %>%
filter(PofZ > 0.50)
NA
Interesting! One of the hopkinsi comes up as caurinus… but with a z-score of -14! What is it?? The z-scores for the other fish are also pretty extraordinary (none are within 2 standard deviations of the mean).
Take a quick look at the three additional hopkinsi in mixture assignment: (based on the iterative assignment from 09- )
tres_mas <- c("R010370", "R010371", "R010373")
tres_mas <- as.tibble(tres_mas) %>%
rename(indiv = value)
# previous 10 samples
just_zs
# plus tres mas
add_three <- down_data %>%
right_join(., tres_mas) %>%
mutate(sample_type = "mixture")
Joining, by = "indiv"
# combine those into one mixture list
thirteen <- bind_rows(just_zs, add_three)
# remove the tres mas from the reference dataframe
mix_ref <- down_data2 %>%
anti_join(., tres_mas)
Joining, by = "indiv"
# Now do the mixture assignment
mix2 <- infer_mixture(reference = mix_ref, mixture = thirteen, gen_start_col = 5)
Collating data; compiling reference allele frequencies, etc. time: 0.60 seconds
Computing reference locus specific means and variances for computing mixture z-scores time: 0.30 seconds
Working on mixture collection: flavidus with 1 individuals
calculating log-likelihoods of the mixture individuals. time: 0.00 seconds
performing 100 burn-in and 2000 more sweeps of method "MCMC" time: 0.27 seconds
tidying output into a tibble. time: 0.04 seconds
Working on mixture collection: hopkinsi with 8 individuals
calculating log-likelihoods of the mixture individuals. time: 0.01 seconds
performing 100 burn-in and 2000 more sweeps of method "MCMC" time: 0.25 seconds
tidying output into a tibble. time: 0.03 seconds
Working on mixture collection: melanops with 1 individuals
calculating log-likelihoods of the mixture individuals. time: 0.00 seconds
performing 100 burn-in and 2000 more sweeps of method "MCMC" time: 0.23 seconds
tidying output into a tibble. time: 0.04 seconds
Working on mixture collection: nigrocinctus with 1 individuals
calculating log-likelihoods of the mixture individuals. time: 0.00 seconds
performing 100 burn-in and 2000 more sweeps of method "MCMC" time: 0.22 seconds
tidying output into a tibble. time: 0.03 seconds
Working on mixture collection: umbrosus with 1 individuals
calculating log-likelihoods of the mixture individuals. time: 0.00 seconds
performing 100 burn-in and 2000 more sweeps of method "MCMC" time: 0.24 seconds
tidying output into a tibble. time: 0.04 seconds
Working on mixture collection: ovalis with 1 individuals
calculating log-likelihoods of the mixture individuals. time: 0.00 seconds
performing 100 burn-in and 2000 more sweeps of method "MCMC" time: 0.22 seconds
tidying output into a tibble. time: 0.03 seconds
Take a look at those results
Wild! So three of the hopkinsi assign to caurinus, but with really really low z-scores. I think this is actually interesting enough to include as a small analysis in the methods/results.
In addition, let’s create a single reporting unit for gopher/black-and-yellow
Since all of the misassignments are GBY, make a single reporting unit.
# change the carnatus reporting unit to chrysomelas
gby_repu_2col <- down_data2 %>%
mutate(repunit = ifelse(repunit == "carnatus", "chrysomelas", repunit))
# confirm that the repunit is changed but the collection is not.
gby_repu_2col %>%
filter(collection == "carnatus")
Now try self-assignment with the single reporting unit
gby_repu_assigned <- gby_repu_2col %>%
self_assign(., gen_start_col = 5)
Summary Statistics:
1004 Individuals in Sample
90 Loci: Plate_1_A01_Sat_GW603857_consensus.1, Plate_1_A11_Sat_GE820299_consensus.1, Plate_2_A09_Sat_EW986980_consensus.1, Plate_2_C08_Sat_EW987116_consensus.1, Plate_2_G06_Sat_EW987118_consensus.1, Plate_3_C03_Sat_GE798118_consensus.1, Plate_4_E10_Sat_EW976030_consensus.1, Plate_4_G06_Sat_EW976181_consensus.1, tag_id_1049.1, tag_id_108.1, tag_id_1184.1, tag_id_1229.1, tag_id_1272.1, tag_id_1366.1, tag_id_1428.1, tag_id_143.1, tag_id_1441.1, tag_id_1449.1, tag_id_1471.1, tag_id_1498.1, tag_id_1558.1, tag_id_1576.1, tag_id_1598.1, tag_id_1604.1, tag_id_1613.1, tag_id_162.1, tag_id_1652.1, tag_id_170.1, tag_id_1708.1, tag_id_1748.1, tag_id_1751.1, tag_id_1762.1, tag_id_179.1, tag_id_1804.1, tag_id_1808.1, tag_id_1810.1, tag_id_1836.1, tag_id_1850.1, tag_id_1880.1, tag_id_1889.1, tag_id_1915.1, tag_id_1950.1, tag_id_1961.1, tag_id_1966.1, tag_id_1982.1, tag_id_1994.1, tag_id_1999.1, tag_id_2008.1, tag_id_2009.1, tag_id_2017.1, tag_id_2062.1, tag_id_2082.1, tag_id_2114.1, tag_id_2134.1, tag_id_2155.1, tag_id_2178.1, tag_id_2182.1, tag_id_220.1, tag_id_2203.1, tag_id_221.1, tag_id_2214.1, tag_id_2237.1, tag_id_2247.1, tag_id_2258.1, tag_id_2301.1, tag_id_2319.1, tag_id_2368.1, tag_id_2499.1, tag_id_250.1, tag_id_2607.1, tag_id_2635.1, tag_id_265.1, tag_id_325.1, tag_id_402.1, tag_id_410.1, tag_id_436.1, tag_id_55.1, tag_id_572.1, tag_id_67.1, tag_id_770.1, tag_id_788.1, tag_id_843.1, tag_id_855.1, tag_id_874.1, tag_id_875.1, tag_id_879.1, tag_id_913.1, tag_id_942.1, tag_id_981.1, tag_id_987.1
53 Reporting Units: aleutianus, alutus, auriculatus, aurora, babcocki, borealis, caurinus, chlorostictus, constellatus, crameri, dallii, diaconus, diploproa, elongatus, emphaeus, ensifer, entomelas, flavidus, hopkinsi, jordani, maliger, melanops, melanostictus, melanostomus, miniatus, moseri, mystinus, nebulosus, nigrocinctus, oculatus, paucispinis, pinniger, polyspinis, proriger, rastrelliger, rosaceus, ruberrimus, rubrivinctus, rufinanus, rufus, saxicola, semicinctus, serriceps, umbrosus, zacentrus, reedi, wilsoni, atrovirens, chrysomelas, goodei, levis, ovalis, serranoides
54 Collections: aleutianus, alutus, auriculatus, aurora, babcocki, borealis, caurinus, chlorostictus, constellatus, crameri, dallii, diaconus, diploproa, elongatus, emphaeus, ensifer, entomelas, flavidus, hopkinsi, jordani, maliger, melanops, melanostictus, melanostomus, miniatus, moseri, mystinus, nebulosus, nigrocinctus, oculatus, paucispinis, pinniger, polyspinis, proriger, rastrelliger, rosaceus, ruberrimus, rubrivinctus, rufinanus, rufus, saxicola, semicinctus, serriceps, umbrosus, zacentrus, reedi, wilsoni, atrovirens, carnatus, chrysomelas, goodei, levis, ovalis, serranoides
8.54249667994688% of allelic data identified as missing
Accuracy?
gby_repu_assigned %>%
filter(scaled_likelihood > 0.95) %>%
filter(repunit == inferred_repunit)
996/1004
[1] 0.9920319
gby_repu_assigned %>%
filter(scaled_likelihood > 0.5 & scaled_likelihood < 0.95)
All assignments below the 95% threshold were gopher/black-and-yellow.
What about z-statistics?
gby_repu_assigned %>%
ungroup() %>%
filter(scaled_likelihood > 0.5 & z_score < -3)
outies <- gby_repu_assigned %>%
ungroup() %>%
filter(scaled_likelihood > 0.5 & z_score < -3)
Three additional hopkinsi, as before.
What were the z-scores for these hopkinsi in the previous self-assignment?
assign_down %>%
ungroup() %>%
filter(scaled_likelihood > 0.5 & repunit == "hopkinsi") %>%
arrange(z_score) %>%
select(indiv, z_score) %>%
ungroup()
Not sure what to do about this. Should I have left in the three hopkinsi that only became problematic the second time around? Or can I justify iterating because the z-statistic is based on the mean of the data, so if I have changed the genotypes that are valid hopkinsi, then the mean has changed, and thus, it is unsurprising that additional samples are now outliers.
To be conservative, I think it makes some sense to remove these additional three fish.
down_data3 <- gby_repu_2col %>%
ungroup() %>%
anti_join(., outies, by = "indiv")
Self-assignment once again, to test whether we can actually eliminate the samples with z-statistics < -3
assigned3 <- self_assign(down_data3, gen_start_col = 5)
Summary Statistics:
1001 Individuals in Sample
90 Loci: Plate_1_A01_Sat_GW603857_consensus.1, Plate_1_A11_Sat_GE820299_consensus.1, Plate_2_A09_Sat_EW986980_consensus.1, Plate_2_C08_Sat_EW987116_consensus.1, Plate_2_G06_Sat_EW987118_consensus.1, Plate_3_C03_Sat_GE798118_consensus.1, Plate_4_E10_Sat_EW976030_consensus.1, Plate_4_G06_Sat_EW976181_consensus.1, tag_id_1049.1, tag_id_108.1, tag_id_1184.1, tag_id_1229.1, tag_id_1272.1, tag_id_1366.1, tag_id_1428.1, tag_id_143.1, tag_id_1441.1, tag_id_1449.1, tag_id_1471.1, tag_id_1498.1, tag_id_1558.1, tag_id_1576.1, tag_id_1598.1, tag_id_1604.1, tag_id_1613.1, tag_id_162.1, tag_id_1652.1, tag_id_170.1, tag_id_1708.1, tag_id_1748.1, tag_id_1751.1, tag_id_1762.1, tag_id_179.1, tag_id_1804.1, tag_id_1808.1, tag_id_1810.1, tag_id_1836.1, tag_id_1850.1, tag_id_1880.1, tag_id_1889.1, tag_id_1915.1, tag_id_1950.1, tag_id_1961.1, tag_id_1966.1, tag_id_1982.1, tag_id_1994.1, tag_id_1999.1, tag_id_2008.1, tag_id_2009.1, tag_id_2017.1, tag_id_2062.1, tag_id_2082.1, tag_id_2114.1, tag_id_2134.1, tag_id_2155.1, tag_id_2178.1, tag_id_2182.1, tag_id_220.1, tag_id_2203.1, tag_id_221.1, tag_id_2214.1, tag_id_2237.1, tag_id_2247.1, tag_id_2258.1, tag_id_2301.1, tag_id_2319.1, tag_id_2368.1, tag_id_2499.1, tag_id_250.1, tag_id_2607.1, tag_id_2635.1, tag_id_265.1, tag_id_325.1, tag_id_402.1, tag_id_410.1, tag_id_436.1, tag_id_55.1, tag_id_572.1, tag_id_67.1, tag_id_770.1, tag_id_788.1, tag_id_843.1, tag_id_855.1, tag_id_874.1, tag_id_875.1, tag_id_879.1, tag_id_913.1, tag_id_942.1, tag_id_981.1, tag_id_987.1
53 Reporting Units: aleutianus, alutus, auriculatus, aurora, babcocki, borealis, caurinus, chlorostictus, constellatus, crameri, dallii, diaconus, diploproa, elongatus, emphaeus, ensifer, entomelas, flavidus, hopkinsi, jordani, maliger, melanops, melanostictus, melanostomus, miniatus, moseri, mystinus, nebulosus, nigrocinctus, oculatus, paucispinis, pinniger, polyspinis, proriger, rastrelliger, rosaceus, ruberrimus, rubrivinctus, rufinanus, rufus, saxicola, semicinctus, serriceps, umbrosus, zacentrus, reedi, wilsoni, atrovirens, chrysomelas, goodei, levis, ovalis, serranoides
54 Collections: aleutianus, alutus, auriculatus, aurora, babcocki, borealis, caurinus, chlorostictus, constellatus, crameri, dallii, diaconus, diploproa, elongatus, emphaeus, ensifer, entomelas, flavidus, hopkinsi, jordani, maliger, melanops, melanostictus, melanostomus, miniatus, moseri, mystinus, nebulosus, nigrocinctus, oculatus, paucispinis, pinniger, polyspinis, proriger, rastrelliger, rosaceus, ruberrimus, rubrivinctus, rufinanus, rufus, saxicola, semicinctus, serriceps, umbrosus, zacentrus, reedi, wilsoni, atrovirens, carnatus, chrysomelas, goodei, levis, ovalis, serranoides
8.55699855699856% of allelic data identified as missing
Check the assignments and check the z-scores
assigned3 %>%
filter(scaled_likelihood > 0.5 & z_score < -3)
Cool, so there is a point at which the samples are all falling within the expected distribution of “accurate” assignments.